Skip to content

Conversation

@lennartkats-db
Copy link
Contributor

@lennartkats-db lennartkats-db commented Dec 19, 2025

This adds an example for defining a User-Defined Table Function (UDTF) in Unity Catalog.

Highlighted files:

@lennartkats-db lennartkats-db changed the title [DRAFT] Add UDTF example Add UDTF example Dec 19, 2025
@lennartkats-db lennartkats-db changed the title Add UDTF example Add a User-Defined Table Function example Dec 19, 2025
- Add k-means clustering UDTF example for Unity Catalog
- Focus documentation on UDTF pattern and SQL accessibility
- Include Python implementation and SQL usage examples
- Add CI/CD integration instructions
import csv
import os
except ImportError:
raise ImportError(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not great for debugging to lose original stack trace. Why not add the message but keep the stacktrace?

try:
  ...
except:
  print(..., file.sys.stderr)
  raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the default-python template though :o The idea is to offer a very short message without a long stack trace to humans when they type pytest instead of uv run pytest

register_udtf_job:
name: register_udtf_job
schedule:
quartz_cron_expression: '0 0 8 * * ?' # daily at 8am
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q - why do you need to register the same udtf daily?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related q - how to unregister? is it just removing this job and there is some clean up process?

Copy link
Contributor Author

@lennartkats-db lennartkats-db Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daily register: It would be much better if we had a "job run" resource that could run the job post-deploy. Running it daily is a compromise. The README talks more about this: customer can leave this compromise in place or they can extend their CI scripts to run the job after deploy.

Unregister: maybe you have ideas how to do that with a "job run" resource in place? Maybe a parameter to the job that runs it before it is destroyed? Could that work?

if not self.rows:
return

import numpy as np
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment why imports are here and not at the top level? Is it so that registration works without these libraries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not essential, let me change this, good callout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I was wrong. Agent added a comment after I told them

hey in #138 there is a comment about inline imports. can you
look at that comment. fix the issue. then verify things still work: the tests would need to run, the job would
need to be deployable + successfully run end-to-end. ultrathink

if you can do all this. then rate how well you did at making sure everything works well. if it's >8 out of 10,
then commit and push!

def register(catalog: str, schema: str, name: str = "k_means"):
"""Register k_means UDTF in Unity Catalog"""
spark = SparkSession.builder.getOrCreate()
source = inspect.getsource(SklearnKMeans)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this function gets access to dependencies? Does it rely on sklearn being pre-provisioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, yes, there apparently is a gap in UDTFs in specifying dependencies. That gap will be fixed.

@lennartkats-db lennartkats-db requested a review from denik January 16, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants